Skip to content

feat: Add optional lz4 compression support for arrays passed via base64 or binref encoding#579

Open
angela-ko wants to merge 23 commits into
mainfrom
ako/compression
Open

feat: Add optional lz4 compression support for arrays passed via base64 or binref encoding#579
angela-ko wants to merge 23 commits into
mainfrom
ako/compression

Conversation

@angela-ko

@angela-ko angela-ko commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Relevant issue or PR

To be done prior to implementing in pasteur-types

Changes are basically identical to the changes here
https://github.com/pasteurlabs/pasteur-types/pull/358/changes

Following the design doc here - chose to start with lz4 as the minimal dependency option for compression, and we can add in more optional compression types once it's working
https://pasteurisi.atlassian.net/wiki/spaces/~71202060d9f9d7be6c427dafac7d77e930e293/pages/1191247903/Compression+-+Design+Options

Description of changes

  • Add optional dependency for lz4
  • Add compress/decompress to array_encodings and output_to_bytes
  • Updated cli and tesseract.py to support compression as well

Testing done

Unit testing

@angela-ko

Copy link
Copy Markdown
Contributor Author

@dionhaefner @nmheim Let me know if this is what you meant by testing compression in tesseract?

@angela-ko angela-ko marked this pull request as ready for review April 30, 2026 18:01
@dionhaefner

Copy link
Copy Markdown
Contributor

That's a good start, thanks @angela-ko ! As next step, please add minimal, meaningful end-to-end tests that cover this functionality - which I expect are going to fail because I do see some issues with how the new lz4 dependency is added :)

Once everything is passing end-to-end I'll have a closer look at the design choices here.

@dionhaefner

Copy link
Copy Markdown
Contributor

And please outline your rationale for choosing lz4 specifically as part of the PR body.

@angela-ko angela-ko marked this pull request as draft May 11, 2026 02:31
@codecov

codecov Bot commented May 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.75758% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.84%. Comparing base (0e08e21) to head (1f7ac9d).

Files with missing lines Patch % Lines
tesseract_core/runtime/array_encoding.py 76.31% 7 Missing and 2 partials ⚠️
tesseract_core/sdk/tesseract.py 68.42% 3 Missing and 3 partials ⚠️
tesseract_core/runtime/cli.py 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #579      +/-   ##
==========================================
+ Coverage   68.30%   77.84%   +9.54%     
==========================================
  Files          39       39              
  Lines        4635     4690      +55     
  Branches      754      770      +16     
==========================================
+ Hits         3166     3651     +485     
+ Misses       1224      727     -497     
- Partials      245      312      +67     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PasteurBot

PasteurBot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

ℹ️ No baseline found — all benchmarks marked as new.

Benchmarks use a no-op Tesseract to measure pure framework overhead.

Benchmark Baseline Current Change Status
api/apply_1,000 - 0.581ms new 🆕
api/apply_100,000 - 0.583ms new 🆕
api/apply_10,000,000 - 0.581ms new 🆕
cli/apply_1,000 - 1704.817ms new 🆕
cli/apply_100,000 - 1830.680ms new 🆕
cli/apply_10,000,000 - 2017.813ms new 🆕
decoding/base64_1,000 - 0.037ms new 🆕
decoding/base64_100,000 - 0.531ms new 🆕
decoding/base64_10,000,000 - 67.930ms new 🆕
decoding/base64+lz4_1,000 - 0.040ms new 🆕
decoding/base64+lz4_100,000 - 0.572ms new 🆕
decoding/base64+lz4_10,000,000 - 115.237ms new 🆕
decoding/binref_1,000 - 0.203ms new 🆕
decoding/binref_100,000 - 0.240ms new 🆕
decoding/binref_10,000,000 - 11.089ms new 🆕
decoding/binref+lz4_1,000 - 0.210ms new 🆕
decoding/binref+lz4_100,000 - 0.290ms new 🆕
decoding/binref+lz4_10,000,000 - 40.529ms new 🆕
decoding/json_1,000 - 0.107ms new 🆕
decoding/json_100,000 - 9.102ms new 🆕
decoding/json_10,000,000 - 1077.563ms new 🆕
encoding/base64_1,000 - 0.042ms new 🆕
encoding/base64_100,000 - 0.149ms new 🆕
encoding/base64_10,000,000 - 29.679ms new 🆕
encoding/base64+lz4_1,000 - 0.048ms new 🆕
encoding/base64+lz4_100,000 - 0.348ms new 🆕
encoding/base64+lz4_10,000,000 - 93.115ms new 🆕
encoding/binref_1,000 - 0.316ms new 🆕
encoding/binref_100,000 - 0.491ms new 🆕
encoding/binref_10,000,000 - 20.579ms new 🆕
encoding/binref+lz4_1,000 - 0.325ms new 🆕
encoding/binref+lz4_100,000 - 0.705ms new 🆕
encoding/binref+lz4_10,000,000 - 84.605ms new 🆕
encoding/json_1,000 - 0.152ms new 🆕
encoding/json_100,000 - 13.522ms new 🆕
encoding/json_10,000,000 - 1417.693ms new 🆕
http/apply_1,000 - 3.121ms new 🆕
http/apply_100,000 - 9.025ms new 🆕
http/apply_10,000,000 - 788.311ms new 🆕
roundtrip/base64_1,000 - 0.088ms new 🆕
roundtrip/base64_100,000 - 0.696ms new 🆕
roundtrip/base64_10,000,000 - 94.201ms new 🆕
roundtrip/base64+lz4_1,000 - 0.099ms new 🆕
roundtrip/base64+lz4_100,000 - 0.938ms new 🆕
roundtrip/base64+lz4_10,000,000 - 211.874ms new 🆕
roundtrip/binref_1,000 - 0.539ms new 🆕
roundtrip/binref_100,000 - 0.739ms new 🆕
roundtrip/binref_10,000,000 - 32.002ms new 🆕
roundtrip/binref+lz4_1,000 - 0.553ms new 🆕
roundtrip/binref+lz4_100,000 - 1.008ms new 🆕
roundtrip/binref+lz4_10,000,000 - 129.190ms new 🆕
roundtrip/json_1,000 - 0.272ms new 🆕
roundtrip/json_100,000 - 20.142ms new 🆕
roundtrip/json_10,000,000 - 2476.531ms new 🆕
Benchmark details
  • Runner: Linux 6.17.0-1018-azure x86_64

@angela-ko angela-ko force-pushed the ako/compression branch 2 times, most recently from 56294af to 17fb949 Compare May 11, 2026 18:21
@angela-ko angela-ko force-pushed the ako/compression branch 3 times, most recently from 8310d77 to dc6a43b Compare May 25, 2026 04:58
@angela-ko angela-ko marked this pull request as ready for review May 25, 2026 04:58
Comment thread pyproject.toml
Comment thread tesseract_core/runtime/array_encoding.py Outdated
Comment thread tests/dummy_tesseract/tesseract_requirements.txt Outdated
Comment thread docs/content/using-tesseracts/array-encodings.md

@dionhaefner dionhaefner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking shape – let's get some clarity on high-level design decisions before diving into details.

@angela-ko angela-ko marked this pull request as draft June 16, 2026 05:55
@angela-ko angela-ko marked this pull request as ready for review June 26, 2026 19:52
@angela-ko angela-ko requested review from dionhaefner and nmheim June 26, 2026 19:52
@dionhaefner dionhaefner changed the title feat: Add lz4 compression to array_encodings feat: Add optional lz4 compression support for arrays passed via base64 or binref encoding Jun 29, 2026
Comment on lines +144 to +146
### binref + lz4 compression

Set `TESSERACT_BINREF_COMPRESSION=lz4` to compress arrays in `.bin` files. Each array is compressed individually, preserving offset-based random access. The compressed size is embedded directly in the buffer path (`<file>:<offset>:<compressed_size>`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now also applies to base64, correct?

Comment on lines +67 to +70
def _lz4_frame():
import lz4.frame

return lz4.frame

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can live in global scope since the dep is now mandatory

output_path: str = "."
output_format: supported_format_type = "json"
output_file: str = ""
binref_compression: Literal["lz4"] | None = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only binref?

if array_encoding == "base64":
return _dump_base64_arraydict(arr)
return _dump_base64_arraydict(
arr, compression=context.get("base64_compression")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we use a single use_compression variable instead of format-specific ones.

@dionhaefner dionhaefner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @angela-ko. Looking real good now, just a last few comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants